# Multimodal Alignment

Hermesflow
Apache-2.0
Hermes Flow is a universal multimodal large language model alignment framework capable of autonomously generating homologous preference data. Through self-play iterative optimization and paired DPO techniques, it seamlessly bridges the gap between multimodal understanding and generation.
Image-to-Text
H
Gen-Verse
218
4
Vit So400m Patch14 Siglip 224.webli
Apache-2.0
Vision Transformer model based on SigLIP, containing only the image encoder part, utilizing original attention pooling mechanism
Image Classification Transformers
V
timm
123
1
AA Chameleon 7b Plus
This is a powerful text-image interleaved input-output model, deeply aligned through the Align Anything algorithm, improving image generation capabilities and human preference alignment.
Text-to-Image Transformers English
A
PKU-Alignment
34
5
Hpt Base
HPT is a transformer model that aligns different entities into a shared latent space, focusing on the study of expansion behaviors in policy learning.
Multimodal Alignment Transformers
H
liruiw
70
10
Languagebind Video Huge V1.5 FT
MIT
LanguageBind is a pretrained model that achieves multimodal semantic alignment through language, capable of binding various modalities such as video, audio, depth, and thermal imaging with language to enable cross-modal understanding and retrieval.
Multimodal Alignment Transformers
L
LanguageBind
2,711
4
Languagebind Audio FT
MIT
LanguageBind is a language-centric multimodal pretraining method that achieves semantic alignment by using language as the bridge between different modalities.
Multimodal Alignment Transformers
L
LanguageBind
12.59k
1
Languagebind Video FT
MIT
LanguageBind is a language-centric multimodal pretraining method that uses language as the bond between different modalities to achieve semantic alignment across video, infrared, depth, audio, and other modalities.
Multimodal Alignment Transformers
L
LanguageBind
22.97k
4
Languagebind Video Merge
MIT
LanguageBind is a multimodal model that extends video-language pretraining to N modalities through language-based semantic alignment, accepted by ICLR 2024.
Multimodal Alignment Transformers
L
LanguageBind
10.96k
4
Languagebind Image
MIT
LanguageBind is a language-centric multimodal pretraining method that uses language as the bond between different modalities to achieve semantic alignment.
Multimodal Alignment Transformers
L
LanguageBind
25.71k
11
Languagebind Depth
MIT
LanguageBind is a language-centric multimodal pretraining method that uses language as the bond between different modalities to achieve semantic alignment across video, infrared, depth, audio, and other modalities.
Multimodal Alignment Transformers
L
LanguageBind
898
0
Languagebind Video
MIT
LanguageBind is a multimodal pretraining framework that extends video-language pretraining to N modalities through language semantic alignment, accepted by ICLR 2024.
Multimodal Alignment Transformers
L
LanguageBind
166
2
Languagebind Thermal
MIT
LanguageBind is a pretraining framework that achieves multimodal semantic alignment through language as the bond, supporting joint learning of various modalities such as video, infrared, depth, and audio with language.
Multimodal Alignment Transformers
L
LanguageBind
887
1
Tinysapbert From TinyPubMedBERT V1.0
TinySapBERT is a compact biomedical entity representation model trained on the SapBERT framework, specifically designed for biomedical named entity recognition tasks.
Large Language Model Transformers
T
dmis-lab
16.93k
0
Distilbert Base Turkish Cased Clip
A Turkish text encoder fine-tuned from dbmdz/distilbert-base-turkish-cased, designed to work with CLIP's ViT-B/32 image encoder
Text-to-Image Transformers
D
mys
2,354
1
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase